Introduction

We chose the NYC Open Dataset for NYPD Motor Vehicle Collisions. This is an open source dataset updated on a daily basis that tracks all road accidents in New York City. We want to use this dataset to infer trends about the various factors that cause accidents. A key question we wish to answer is Can we provide insights to help make the streets of NYC safer? Understanding the various reasons for accidents can certainly help us answer this question. For example, we want to identify the “hotspots” - intersections, streets etc. where most accidents occur. This could be a useful insight for authorities to take to necessary actions to remediate some of the issues.

In addition to this report, we have 2 interactive plots:

The github repo for the project is here.

Questions of Interest

  1. Location
  • How are the accidents distributed across boroughs?
  • Which locations (intersections, streets) are the hotspots for accidents?
  • Why do some zip codes have more accidents?
  1. Time:
  • What time of the day is prone to accidents?
  • How has accident frequency and deaths changed over years?
  • How does the distribution of accidents vary across months?
  1. Accident Information:
  • What are the main factors contributing to accidents?
  • What are the types of Vehicles that cause more accidents?
  • What are the causes of accidents in different vehicle types?
  • Which mode of transportation is more likely to get involved in an accident?
  • What are the main reasons behind accidents at hotspots (ex. inattentive driver)?
  1. Potpourri:
  • What has been going on around Columbia University campus?
  • Any intersections around Columbia we need to be wary of?
  • Am I safer if I walk on Amsterdam instead of Broadway?
  • Which neigborhoods have high accident rates?

Team Members and Contributions

  1. Aakanksha Joshi - aj2771 The CTO and Editorial Director. Created the D3 interactive chart for Accidents by Borough, contributed to the Location analysis, Vision Zero analysis and executive summary.

  2. Adarsh Chavakula - ac4244 The CEO and (Tidyverse) Advisory Committee. Contributed to the data quality analysis, Columbia neighborhood analysis and compiled the project report.

  3. Deepak Maran - dm3308 Board Executive and Chief Design Officer. Created the D3 interactive chart for Contributing Factors over Time, contributed to the Time Analysis and executive summary.

  4. Harsheel Singh Soin - hss2148 The COO and Design Insights Lead. Contributed to the Accident Information analysis, location analysis, heatmaps, choropleths, the Vision Zero analysis and executive summary.

Description of the Data

The data was collected from the NYC Open Data website. It provides information on the exact location, borough, time, cause, type of vehicles, number of deaths and injuries for every accident. The data is from third quarter of 2012 to the first quarter of 2018. This data is collected by the police department (NYPD) because the NYC Council passed a local law in 2011. The data is manually run every month and reviewed by the TrafficStat unit before being posted on the NYPD website.

Each record represents a collision in NYC by borough, precinct and cross street. The data can potentially provide insights into how dangerous or safe certain intersections in NYC are. The information is available in both pdf and csv formats.

We downloaded the CSV file from the NYC Open Data webpage for this dataset. The dataset is updated on a daily basis (during weekdays). For the sake of uniformity and consistency of analyses, we used the accident data collected upto March 13, 2018.

Features: The dataset is 1.22 million rows and 29 columns (260 MB). The columns cover key aspects of every accident: Date, Time, Borough, Zip Code, Latitude, Longitude, Location, On Street Name, Cross Street Name, Number of Persons Injured, Number of Persons Killed, Number of Pedestrians Injured, Number of Pedestrians Killed, Number of Cyclist Injured, Number of Cyclist Killed, Number of Motorist Injured, Number of Motorist Killed etc.

library(tidyverse)
library(ggmap)
library(choroplethrZip)
accidents <- read_csv("NYPD_Motor_Vehicle_Collisions.csv")
names(accidents)
##  [1] "DATE"                          "TIME"                         
##  [3] "BOROUGH"                       "ZIP CODE"                     
##  [5] "LATITUDE"                      "LONGITUDE"                    
##  [7] "LOCATION"                      "ON STREET NAME"               
##  [9] "CROSS STREET NAME"             "OFF STREET NAME"              
## [11] "NUMBER OF PERSONS INJURED"     "NUMBER OF PERSONS KILLED"     
## [13] "NUMBER OF PEDESTRIANS INJURED" "NUMBER OF PEDESTRIANS KILLED" 
## [15] "NUMBER OF CYCLIST INJURED"     "NUMBER OF CYCLIST KILLED"     
## [17] "NUMBER OF MOTORIST INJURED"    "NUMBER OF MOTORIST KILLED"    
## [19] "CONTRIBUTING FACTOR VEHICLE 1" "CONTRIBUTING FACTOR VEHICLE 2"
## [21] "CONTRIBUTING FACTOR VEHICLE 3" "CONTRIBUTING FACTOR VEHICLE 4"
## [23] "CONTRIBUTING FACTOR VEHICLE 5" "UNIQUE KEY"                   
## [25] "VEHICLE TYPE CODE 1"           "VEHICLE TYPE CODE 2"          
## [27] "VEHICLE TYPE CODE 3"           "VEHICLE TYPE CODE 4"          
## [29] "VEHICLE TYPE CODE 5"
head(accidents)

Noteworthy Features: It is interesting how much detail NYPD has gone into while recording this information about each accident, so much that upto 5 vehicles contributing to an accident are noted, along with their corresponding vehicle codes; the street intersections are noted along with exact latitude and longitude information. They’ve also recorded extra information such as the number of cyclists and pedestrians killed/injured as a result of the collision/accident.

Analysis of Data Quality

We started with the analysis of missing values across the variables. Our first step was to use the skim function in the skimr package to get the percentage of missing values across the variables.

skimr::skim(accidents) %>% filter(stat == "missing") %>%
arrange(desc(value)) %>% select(variable, type, value) %>% mutate(percent = value/nrow(accidents))

We observe that variables like CONTRIBUTING FACTOR VEHICLE 5, VEHICLE TYPE CODE 5 etc. have the highest missing values - over 99%. These variables correspond to description of accidents involving as high as 5 vehicles, which would happen in very few cases.

It is noteworthy that variables like BOROUGH and ZIP CODE have about 28% missing values.

Since most of the observations do not have variables CONTRIBUTING FACTOR VEHICLE 5, VEHICLE TYPE CODE 5 etc, we decided to drop them. To avoid loss of useful information from this, we created a new binary variable which indicates if 3 or more vehicles were involved in an accident.

accidents <- accidents %>% mutate(VEHICLES_INVOLVED_GTE3 = 1 * !is.na(`CONTRIBUTING FACTOR VEHICLE 3`))
accidents <- accidents %>% select(-c(`VEHICLE TYPE CODE 5`,`CONTRIBUTING FACTOR VEHICLE 5`,
                                     `VEHICLE TYPE CODE 4`,`CONTRIBUTING FACTOR VEHICLE 4`,
                                     `VEHICLE TYPE CODE 3`,`CONTRIBUTING FACTOR VEHICLE 3`,
                                     `OFF STREET NAME`))

Our next step is to identify if the values are missing at random. The visna function in the extracat library is useful for that.

library(extracat)
visna(accidents)

We observe that several variables do not have any missing values. It is more informative to do the visna considering a subset of the variables that have missing values.

accidents_sub <- accidents %>% select(BOROUGH,`ZIP CODE`,LOCATION,
                                      `ON STREET NAME`, `CROSS STREET NAME`, 
                                      `CONTRIBUTING FACTOR VEHICLE 1`,`VEHICLE TYPE CODE 1`)
visna(accidents_sub)

We observe the following:

  1. Vehicle Types and Contributing Factor are a part of all missing patterns.
  2. Borough and ZIP Code tend to be missing together.
  3. On Street and Cross Street also tend to be missing simultaneously.

Based on these observations we can conclude that the variables are Missing at Random (MAR).

We wanted to see how the quality of the data has been changing over time. Our initial hypothesis for the missing values was that the data collection process might not have been great when the NYC Open Data portal started and may have improved over the years. However, the data suggests otherwise:

accidents <- accidents %>% 
  mutate(DATE = as.Date(DATE, format='%m/%d/%Y')) %>% 
  arrange(DATE) %>% 
  mutate(YEAR = format(DATE, "%Y")) %>%
  mutate(MONTH = format(DATE, "%m"))

accidents_missing <- accidents %>% 
  mutate(MISSING_BOROUGH=1*is.na(BOROUGH), 
         MISSING_CONTRIBUTION_INFO=1*is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
  select(MISSING_BOROUGH, MISSING_CONTRIBUTION_INFO, YEAR, BOROUGH)

missing_borough_by_year <- accidents_missing %>%
  group_by(YEAR) %>% summarise(pc_missing_borough=mean(MISSING_BOROUGH))

missing_contrib_by_year <- accidents_missing %>%
  group_by(YEAR) %>% summarise(pc_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))

missing_contrib_by_borough <- accidents_missing %>%
  group_by(BOROUGH) %>% summarise(pc_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
ggplot(data=missing_borough_by_year, aes(x=YEAR, y=pc_missing_borough)) + geom_col(fill="lightblue", color="black") +
  xlab("Year") + ylab("% Missing Borough Information") +
  ggtitle("% Missing Borough Information vs Year") +
  theme(plot.title = element_text(hjust = 0.5))

It appears that the quality of the data has been deteriorating over time. 2012 had only about 23% missing Borough information while 2017 was close to 40%.

Location - Where do most Accidents Happen?

Boroughs

Let’s first look at the broad level of the different NYC boroughs and see where most accidents happen:

ggplot(data=subset(accidents, !is.na(BOROUGH)), aes(BOROUGH)) + geom_bar(fill="lightblue", color="black") + xlab("Borough") + ylab('Count of Accidents') + ggtitle('Accidents by Borough (2012 - 2018)') + theme(plot.title = element_text(hjust = 0.5))

It appears that Brooklyn is the borough with the most number of accidents. This is not surprising as it is the most populated one. We can get a better understanding of this by looking at the ratio of annual accidents to the populations of these boroughs (Population data).

We also visualized Borough-Wise Distribution of Accidents from 2013-2018. This is an interactive D3 chart.

# Accidents per 100,000 in all Boroughs 

borough_per_1000 <- accidents %>% 
  group_by(`BOROUGH`) %>% 
  summarise(count_accidents = n()) %>% 
  filter(!is.na(`BOROUGH`)) %>%
  select(`BOROUGH`, count_accidents) %>% 
  arrange(desc(count_accidents))

pops <- c(1643734, 2629150, 1455720, 2333054, 476015)
boroughs <- c("MANHATTAN","BROOKLYN","BRONX","QUEENS","STATEN ISLAND")

pop_table <- data_frame(POPS = pops, BOROUGH = boroughs)

accidents_summary <- borough_per_1000 %>% left_join(pop_table, by="BOROUGH") %>% mutate(count_accidents = count_accidents/POPS *1000)

ggplot(data=accidents_summary, aes(x=BOROUGH, y=count_accidents)) + geom_col(fill="lightblue", color="black") + xlab("Borough") + ylab('Accidents per 1000 residents') + ggtitle('Accidents per 1000 residents by Borough') + 
  theme(plot.title = element_text(hjust = 0.5))

This chart reveals a different picture compared to the prior chart. Manhattan has a 25% higher number of accidents per 1000 residents compared to Brooklyn - making it more dangerous.

ZIP Codes

We looked at a more granular level of aggregation with ZIP codes to identify neighborhoods with the most accidents. These can be plotted on a map using the choroplethrZip library. This library is not available on CRAN and has to be downloaded from GitHub.

Installation Instructions:

  1. install.packages("devtools")
  2. library(devtools)
  3. install_github('arilamstein/choroplethrZip@v1.3.0')
library(choroplethrZip)
accidents_zip_other <- accidents %>% filter(!is.na(`ZIP CODE`)) %>% 
  group_by(`ZIP CODE`,BOROUGH) %>% 
  summarise(NUMBER_OF_ACCIDENTS=n())

names(accidents_zip_other) <- c("region", "BOROUGH","value") 
accidents_zip_other$region <- factor(accidents_zip_other$region)
accidents_zip_other <- accidents_zip_other[!duplicated(accidents_zip_other$region),]

zip_choropleth(accidents_zip_other,
               state_zoom = "new york",
               county_zoom = c(36005,36047,36061,36081,36085),
               title      = "Accidents by Zip Code in New York City",
               legend     = "Number of Accidents") + 
  theme(plot.title = element_text(hjust = 0.5, size=15)) + 
  viridis::scale_fill_viridis(name="# Accidents",discrete = TRUE) + coord_map()

We can make a few quick observations from the chart:

  1. Midtown Manhattan saw a large number of accidents per ZIP code - despite the fact that the neighborhoods are relatively small.
  2. Brooklyn also had a large number of ZIP codes with high accident rates.
  3. Large parts of Queens, Bronx, Staten Island and Uptown Manhattan had relatively fewer accidents.

Zooming in on Manhattan:

accidents_zip <- accidents %>% filter(!is.na(`ZIP CODE`)) %>% filter(BOROUGH=="MANHATTAN") %>% group_by(`ZIP CODE`) %>% summarise(NUMBER_OF_ACCIDENTS=n()) 
names(accidents_zip) <- c("region", "value")
accidents_zip$region <- factor(accidents_zip$region)

zip_choropleth(accidents_zip,
               state_zoom = "new york",
               county_zoom = 36061,
               title      = "Accidents by Zip Code in Manhattan",
               legend     = "Number of Accidents") +theme(plot.title = element_text(hjust = 0.5, size=15))+viridis::scale_fill_viridis(name="# Accidents",discrete = TRUE)+coord_map()

We also looked at the ZIP codes with the highest fatalities.

accidents_zip_deaths <- accidents %>% filter(!is.na(`ZIP CODE`)) %>% group_by(`ZIP CODE`,BOROUGH) %>% summarise(NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`))

names(accidents_zip_deaths) <- c("region", "BOROUGH","value") 
accidents_zip_deaths$region <- factor(accidents_zip_deaths$region)
accidents_zip_deaths <- accidents_zip_deaths[!duplicated(accidents_zip_deaths$region),]

zip_choropleth(accidents_zip_deaths,
               state_zoom = "new york",
               county_zoom = c(36005,36047,36061,36081,36085),
               title      = "Deaths by Zip Code in New York City",
               legend     = "Number of Deaths") +theme(plot.title = element_text(hjust = 0.5, size=15))+viridis::scale_fill_viridis(name="# Deaths",discrete = TRUE)+coord_map()

The ZIP codes which have the highest accidents are the following:

zip_count <- accidents %>% 
  group_by(`ZIP CODE`) %>% 
  summarise(count_accidents = n()) %>% 
  filter(!is.na(`ZIP CODE`)) %>%
  select(`ZIP CODE`, count_accidents) %>% 
  arrange(desc(count_accidents)) %>%
  mutate(`ZIP CODE` = as.character(`ZIP CODE`))

zip_count_top10 <- zip_count[1:10,]

dotchart(zip_count_top10$count_accidents, labels = zip_count_top10$`ZIP CODE`,
         pch=19, xlab = "Accident Count", groups = -zip_count_top10$count_accidents, 
         xlim=c(0,16000), ylab = "ZIP code",
         main = "Accidents - Top 10 ZIP codes")

We also looked at the top 10 ZIP codes for fatalities:

zip_count_deaths <- accidents %>% 
  group_by(`ZIP CODE`) %>% 
  summarise(count_deaths = sum(`NUMBER OF PERSONS KILLED`)) %>% 
  filter(!is.na(`ZIP CODE`)) %>%
  select(`ZIP CODE`, count_deaths) %>% 
  arrange(desc(count_deaths)) %>%
  mutate(`ZIP CODE` = as.character(`ZIP CODE`))

zip_count_deaths_top10 <- zip_count_deaths[1:10,]

dotchart(zip_count_deaths_top10$count_deaths, labels = zip_count_deaths_top10$`ZIP CODE`,
         pch=19, xlab = "Deaths", xlim=c(0,30),  main = "Deaths - Top 10 ZIP codes", ylab = "ZIP code",
         groups = -zip_count_deaths_top10$count_deaths)

To get a more granular view of where most accidents happen, we superimposed a heatmap on top of the NYC map to identify specific roads and interesctions with most accidents. The NYC map was obtained using the handy ggmap library.

nycmap <- qmap("new york", zoom = 11, color = "bw")
nycmap + stat_bin2d(aes(x = LONGITUDE, y = LATITUDE), size = .5, bins = 150, 
                    alpha = 0.5, data = accidents) + viridis::scale_fill_viridis() + 
  ggtitle("Accident Density by Location") + theme(plot.title = element_text(hjust = 0.5))

The heatmap is lit up brightly for midtown and lower Manhattan. There are also hotspots in Brooklyn (along Atlantic Avenue) which are brightly lit. This chart illustrates why the accidents per 1000 residents is the highest for Manhattan.

Verifying Vision Zero

“Vision Zero is a safe-streets initiative created by New York City’s Department of Transportation (DOT) in 2014, the first year of Mayor Bill de Blasio’s administration. The Vision Zero concept and program emerged in the late 1990s in Sweden, which aimed to eliminate traffic deaths and serious injuries. A number of other countries and cities subsequently adopted similar traffic safety programs; and safe-street improvements were undertaken under the administration of New York’s previous mayor, Michael Bloomberg,” reports a Nacto report from 2018.. “In NYC, Vision Zero consists of reengineering intersections and streets: it simplifies complex intersections, narrows traffic lanes, adds speed bumps and bicycle paths, shortens pedestrian-crossing distances, alters the timing of traffic lights, adds speed-detection and red-light cameras, and installs “leading pedestrian intervals” to give pedestrians a head start at a light before drivers can turn into the crosswalk,” the report further adds.

“In January 2014, Mayor Bill de Blasio announced adoption of New York City’s Vision Zero and enumerated a long list of initiatives the city would be following to reduce fatalities on city streets. Among the measures it plans to take includes pushing for changes in the State legislature to allow the city more control in the administration of traffic safety measures such as speed reduction,” writes the Wikipedia on Vision Zero.

“According to a OneNYC report from 2017, the Department of Transportation (DOT) has completed a total of 356 street improvement projects, since the start of Vision Zero, with an emphasis on the city’s most crash-prone corridors and intersections. The Great Streets initiative focuses on improving safety on four of New York City’s major arterial routes: Atlantic Avenue, Fourth Avenue, Grand Concourse, and Queens Boulevard,” reports OneNYC.

We decided to look into the accident trends on Atlantic Avenue and Queens Boulevard and see if Vision Zero had an impact.

accidents_queens <- accidents %>% filter(`ON STREET NAME`=='QUEENS BOULEVARD') %>% group_by(`YEAR`) %>% summarise(`Number of Accidents`=n()) %>% 
  mutate(YEAR = as.numeric(YEAR))

ggplot(accidents_queens, aes(x=YEAR, y=`Number of Accidents`)) + geom_point() +geom_smooth(method = "loess", se=FALSE) + ggtitle("Number of Accidents by Year across Queens Boulevard") + theme(plot.title = element_text(hjust = 0.5))+xlab("Year")

accidents_atlav <- accidents %>% filter(`ON STREET NAME`=='ATLANTIC AVENUE') %>% group_by(`YEAR`) %>% summarise(`Number of Accidents`=n())%>% mutate(YEAR = as.numeric(YEAR))

ggplot(accidents_atlav, aes(x=YEAR, y=`Number of Accidents`)) + geom_point() +geom_smooth(method = "loess", se=FALSE) + ggtitle("Number of Accidents by Year across Atlantic Avenue") + theme(plot.title = element_text(hjust = 0.5))+xlab("Year")

We can see that the number of accidents across two of the most accident-prone streets, Atlantic Avenue and Queens Boulevard, has gone down considerably since the introduction of Vision Zero in 2014.

Time: When do Accidents Happen?

We tried to understand the trend in accidents over time. We started by looking at daily accidents over time (over the span of 5 years). This is obviously expected to be noisy, so we added a trend curve using the non-parametric loess smoothing.

accidents_per_day <- accidents %>%
  group_by(DATE, BOROUGH) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`), 
            NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`)) %>%
  filter(NUMBER_OF_ACCIDENTS<400) # Remove one outlier point 


ggplot(accidents_per_day, aes(x=DATE, y=NUMBER_OF_ACCIDENTS)) + geom_line(aes(colour="Accidents")) +
  geom_smooth(method = "loess", span = .2, se = TRUE, show.legend = TRUE, aes(colour="Trend")) +
  scale_colour_manual(name="Legend", values=c("black", "red")) +
  xlab("Time") + ylab("Number of Daily Accidents") +
  ggtitle("Daily Accidents") +
  theme(plot.title = element_text(hjust = 0.5))

There are a few quick observations we can make from this chart:

  1. There seems to be an abrupt jump in the number of days with more than 200 accidents sometime around mid 2016. We believe that this might be due to data collection issues rather than a systematic change in accident trends.
  2. The trend curve has a very gradual overall increase with time.
  3. The trend curve seems to be oscillating - indicating a seasonality in the data.

Accidents by Month

To better understand the seasonality aspect of the data, we looked at the number of accidents aggregated over a month level.

accidents_by_month <- accidents %>% filter(YEAR > 2012) %>% filter(YEAR < 2018) %>%
   group_by(MONTH) %>% summarise(num_accidents = n())
  
ggplot(accidents_by_month, aes(x=MONTH, y=num_accidents)) +
  geom_bar(stat="identity", fill="lightblue", color="black")+
  xlab("Month") + ylab("Total Accidents") +
  ggtitle("Accidents by month") +
  theme(plot.title = element_text(hjust = 0.5))

The chart shows a dip in the accidents for the first few months of an year. There seem to be fewer accidents in winter. We also tried to plot the same information in a polar coordinate plot to give a better sense of the seasonality.

ggplot(accidents_by_month, aes(x=MONTH, y=num_accidents)) +
  geom_bar(stat="identity", fill="lightblue", color="black")+
  xlab("Month") + ylab("Total Accidents") +
  ggtitle("Accidents by month") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_polar(start=0)

Accidents by Time of Day

Besides time of year, we were interested in looking at trends by time of day. We aggregated the data by the hour and looked at the accidents. Like before, we made both the cartesian bar charts as well as the polar bar charts.

accidents_by_timeofday <- accidents %>%
  mutate(HOUR = format(as.POSIXct(TIME), "%H")) %>%
  group_by(HOUR, BOROUGH) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`), 
            NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`))# %>%
#  filter(NUMBER_OF_ACCIDENTS<400) # Remove one outlier point 


ggplot(accidents_by_timeofday, aes(x=HOUR, weights=NUMBER_OF_ACCIDENTS)) +
  geom_bar(stat="count", fill="lightblue", color="black") +
  xlab("Time") + ylab("Accidents") +
  ggtitle("Accidents by Hour") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(accidents_by_timeofday, aes(x=HOUR, weights=NUMBER_OF_ACCIDENTS)) +
  geom_bar( stat="count", fill="lightblue", color="black") +
  xlab("Time") + ylab("Accidents") +
  ggtitle("Accidents by Hour") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_polar()

The frequency of accidents peak once during the morning (around 9 am) and once in the evening (around 5 pm). This is probably because these are the peak hours of traffic. Also the number of accidents in the peak hours of the evening is much higher than the number of accidents in the morning hours.

On Design Choices…

The above chart is a polar coordinate representation of the bar chart preceding it. The Cartesian bar chart is significantly easier to read and draw inferences from. The differneces in values are linear and intuitive in the Cartesian chart. However it fails to capture the cyclical nature of months (December looping over to January) and time of day (23 to 0) which is better captured by the Polar bar chart.

Contributing Factors for Accidents

We analyzed the different contributing factors leading to accidents over the years. This is captured by the CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 variables. They correspond to the reason for Vehicle 1 and Vehicle 2 respectively. These variables have a large number of missing values:

print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`"
print(sum(is.na(accidents$`CONTRIBUTING FACTOR VEHICLE 1`)))
## [1] 6297
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`"
print(sum(is.na(accidents$`CONTRIBUTING FACTOR VEHICLE 2`)))
## [1] 170620

CONTRIBUTING FACTOR VEHICLE 1 has much fewer missing values than CONTRIBUTING FACTOR VEHICLE 2. Analyzing them separately for the purpose of better analyses:

checkContrFact1_2 <- accidents %>% 
  filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>% 
  filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified" & `CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>%
  mutate(sameValue = `CONTRIBUTING FACTOR VEHICLE 1`==`CONTRIBUTING FACTOR VEHICLE 2`) 

print("Number of records where `CONTRIBUTING FACTOR VEHICLE 1` equals `CONTRIBUTING FACTOR VEHICLE 2`:")
## [1] "Number of records where `CONTRIBUTING FACTOR VEHICLE 1` equals `CONTRIBUTING FACTOR VEHICLE 2`:"
print(sum(checkContrFact1_2$sameValue))
## [1] 98407

Upon analyzing CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 columns after omitting records containing NAs and only considering records where both CONTRIBUTING FACTOR VEHICLE 1 and CONTRIBUTING FACTOR VEHICLE 2 are specified, we observe that 98407 records out of 160161 have the same entry for these two columns.

We plot the top 10 contributing factors for Vehicle 1.

accidents_by_reason_1 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>% 
  filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified") %>% 
  group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(NUMBER_OF_ACCIDENTS))

accidents_by_reason_1 <- head(accidents_by_reason_1, n = 10)

ggplot(accidents_by_reason_1, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 1`, `NUMBER_OF_ACCIDENTS`), 
                                  y=`NUMBER_OF_ACCIDENTS`)) + 
  geom_col(fill="lightblue", color="black") + 
  xlab("Contributing Factor Vehicle 1") +
  ylab("Number of Accidents")+
  ggtitle("Top 10 First Contributing Factors by Accident Count")+
  theme(plot.title = element_text(hjust = 0.5))+coord_flip()

accidents_by_reason_2 <- accidents %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>% 
  filter(`CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>% 
  group_by(`CONTRIBUTING FACTOR VEHICLE 2`) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(NUMBER_OF_ACCIDENTS))

accidents_by_reason_2 <- head(accidents_by_reason_2, n = 10)

ggplot(accidents_by_reason_2, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 2`, `NUMBER_OF_ACCIDENTS`), 
                                  y=`NUMBER_OF_ACCIDENTS`)) +
  geom_col(fill="lightblue", color="black") + 
  xlab("Contributing Factor Vehicle 2")+ylab("Number of Accidents")+
  ggtitle("Top 10 Second Contributing Factors by Accident Count")+
  theme(plot.title = element_text(hjust = 0.5))+coord_flip()

We used a heatmap to understand the combination of contributing factors for Vehicle 1 and contributing factors for Vehicle 2 resulting in most accidents.

accidents_by_vehicles <- accidents %>% 
  filter(!is.na(`VEHICLE TYPE CODE 1`) & !is.na(`VEHICLE TYPE CODE 2`)) %>% 
  filter(`VEHICLE TYPE CODE 1` != "UNKNOWN" & `VEHICLE TYPE CODE 2` != "UNKNOWN") %>% 
  group_by(`VEHICLE TYPE CODE 1`,`VEHICLE TYPE CODE 2`) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(NUMBER_OF_ACCIDENTS))

accidents_by_vehicles <- head(accidents_by_vehicles, n = 50)

ggplot(accidents_by_vehicles, aes(x = `VEHICLE TYPE CODE 1`, y = `VEHICLE TYPE CODE 2`, fill = NUMBER_OF_ACCIDENTS)) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  geom_tile()+xlab("First Vehicle Type")+ylab("Second Vehicle Type")+
  ggtitle("Number of Accidents by combinations of Vehicle Types")+
  theme(plot.title = element_text(size=18, hjust = 0.5))+
  viridis::scale_fill_viridis()

We observe that most accidents occur when one or more drivers of the vehicles involved were distracted/inattentive.

Contributing Factors and Time

We looked at Time Variation of Contributing Factors from 2013-2017. This is an interactive D3 chart.

Contributing Factors and Locations

We wanted to observe which locations in NYC were more likely to see the particular contributing factors we noted above. For this we focused only on Manhattan to get a closer view. We also looked only at the top 5 contributing factors.

accidents_by_reasons <- accidents %>% 
  filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>% 
  filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified" & `CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>%
  group_by(`CONTRIBUTING FACTOR VEHICLE 1`,`CONTRIBUTING FACTOR VEHICLE 2`) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(NUMBER_OF_ACCIDENTS))

accidents_by_reasons_map <- head(accidents_by_reasons, n=5)
accidents_topCF12 <- accidents %>%
  filter(`CONTRIBUTING FACTOR VEHICLE 1` %in% accidents_by_reasons_map$`CONTRIBUTING FACTOR VEHICLE 1` & `CONTRIBUTING FACTOR VEHICLE 2` %in% accidents_by_reasons_map$`CONTRIBUTING FACTOR VEHICLE 2`) %>% 
  filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>% filter(BOROUGH=='MANHATTAN')

qmap('manhattan', zoom = 12,color = 'bw', legend = 'topleft') + 
  geom_point(aes(x=LONGITUDE, y=LATITUDE, color=`CONTRIBUTING FACTOR VEHICLE 1`), 
             data=accidents_topCF12,alpha=0.6, size=1.5) +
  viridis::scale_color_viridis(discrete=TRUE)+
  ggtitle("Spatial View of Accidents by Top 5 Contributing Factors in Manhattan")+
  theme(plot.title = element_text(hjust = 0.5, size=15))

We observe an unusual pattern here - the different contributing factors have different “hotspots” across Manhattan. This is better illustrated by using 2D density plots faceted on contributing factor.

qmap('manhattan', zoom = 13,color = 'bw', legend = 'topleft') + 
  geom_point(aes(x=LONGITUDE, y=LATITUDE, color=`NUMBER OF PERSONS KILLED`), 
             data=accidents_topCF12,alpha=0.7, size=1.0) + viridis::scale_color_viridis()+ggtitle("Spatial View of Accidents by Top 5 Contributing Factors in Manhattan")+
  theme(plot.title = element_text(hjust = 0.5, size=28))+
  geom_density_2d(aes(x=LONGITUDE, y=LATITUDE, color=`NUMBER OF PERSONS KILLED`), 
                  data=accidents_topCF12, color="red")+
  facet_wrap(~ `CONTRIBUTING FACTOR VEHICLE 1`, nrow=3)+
  theme(panel.spacing.x=unit(2, "lines"),panel.spacing.y=unit(2, "lines"))+
  theme(strip.text.x = element_text(size = 25))

Quick Observations:

  1. Driver inattention seems to be spread out across Manhattan. There are multiple hotspots for it.
  2. The hotspot for “Failure to yield right of way” seems to be around the Empire State Building.
  3. The hotspot for “Fatigued/Drowsy driver” is around the Trump Tower on 3rd Avenue.
  4. There seem to be a lot of accidents related to “Lost Consciousness” around Times Square.
  5. The “Other Vehicular” is concentrated on the East side of town.

Vehicles Involved

Finally we analyzed the vehicles usually involved in accidents and tried to look for patterns there:

accidents_by_vehicles_new <- head(accidents_by_vehicles, n=10)
accidents_by_reasons_new <- head(accidents_by_reasons, n=10)

accidents_cf_vt <- accidents %>% 
  filter(!is.na(`VEHICLE TYPE CODE 1`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%  
  filter(`VEHICLE TYPE CODE 1` != "UNKNOWN" & `CONTRIBUTING FACTOR VEHICLE 1`!= "Unspecified") %>% filter(`VEHICLE TYPE CODE 1` %in% accidents_by_vehicles_new$`VEHICLE TYPE CODE 1`) %>% filter(`CONTRIBUTING FACTOR VEHICLE 1` %in% accidents_by_reasons_new$`CONTRIBUTING FACTOR VEHICLE 1`) %>% group_by(`VEHICLE TYPE CODE 1`, `CONTRIBUTING FACTOR VEHICLE 1`) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(NUMBER_OF_ACCIDENTS))
head(accidents_cf_vt,n=10)
ggplot(accidents_cf_vt, aes(y = `VEHICLE TYPE CODE 1`, x = `CONTRIBUTING FACTOR VEHICLE 1`, fill = NUMBER_OF_ACCIDENTS)) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))+
  geom_tile() + xlab("Contributing Factor Vehicle 1") + 
  ylab("Vehicle Type 1")+
  ggtitle("# Accidents by First Vehicle Types and Contributing Factors")+
  theme(plot.title = element_text(hjust = 0.5))+viridis::scale_fill_viridis()

accidents_cf_vt2 <- accidents %>% 
  filter(!is.na(`VEHICLE TYPE CODE 2`) & !is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%  
  filter(`VEHICLE TYPE CODE 2` != "UNKNOWN" & `CONTRIBUTING FACTOR VEHICLE 2`!= "Unspecified") %>% filter(`VEHICLE TYPE CODE 2` %in% accidents_by_vehicles_new$`VEHICLE TYPE CODE 2`) %>% filter(`CONTRIBUTING FACTOR VEHICLE 2` %in% accidents_by_reasons_new$`CONTRIBUTING FACTOR VEHICLE 2`) %>% group_by(`VEHICLE TYPE CODE 2`, `CONTRIBUTING FACTOR VEHICLE 2`) %>%
  summarise(NUMBER_OF_ACCIDENTS=n(), 
            NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
  arrange(desc(NUMBER_OF_ACCIDENTS))
head(accidents_cf_vt2,n=10)
ggplot(accidents_cf_vt2, aes(y = `VEHICLE TYPE CODE 2`, x = `CONTRIBUTING FACTOR VEHICLE 2`, fill = NUMBER_OF_ACCIDENTS)) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  geom_tile()+xlab("Contributing Factor Vehicle 2") +
  ylab("Vehicle Type 2")+
  ggtitle("# Accidents by Second Vehicle Types and Contributing Factors")+
  theme(plot.title = element_text(hjust = 0.5))+
  viridis::scale_fill_viridis()

Potpourri: Around the Columbia Neighborhood

We filtered our dataset to look at accidents in the Columbia neighborhood. In particular, we are interested in accidents in the year 2017 for ZIP codes 10027 (Morningside Heights) and 10025 (Upper West Side).

accidents_cu <- accidents %>% 
  filter(`ZIP CODE`== 10027 | `ZIP CODE`== 10025) %>% 
  filter(!is.na(LOCATION)) %>% 
  filter(YEAR==2017)

The ggmap library has the amazing support to just let us use “Columbia University” as a keyword input to fetch the map of the neighborhood. We then use a scatterplot (with alpha blending) of all the accidents and use a 2D density plot to focus on the hotspots.

library(ggmap)
require(gridExtra)

p <- qmap(location = "columbia university", zoom = 14) + theme(plot.margin=unit(c(0.1,0.1,0.1,0.1), "cm")) +
  ggtitle("Columbia University Neighborhood") +
  theme(plot.title = element_text(hjust = 0.5))
q <- p + 
  geom_point(data=accidents_cu, aes(x=LONGITUDE, y=LATITUDE), alpha=0.1, colour="blue") +
  geom_density_2d(data=accidents_cu, aes(x=LONGITUDE, y=LATITUDE)) + 
  ggtitle("Road Accident Hotspots") +
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(p, q, ncol=2)

There are two major hotspots revealed by the density plot -

  1. 96th St and Broadway/Amsterdam Av.
  2. Harlem 125th St and Adam Clayton Powell Jr Av/ Fredric Douglas Blvd.

These are both localities where a lot of grad students live and commute and they had the highest number of accidents in 2017. Other places with a large number of accidents are 110th St and Broadway, 125th St and Broadway, 125th St and Amsterdam etc. All intersections on 96th St and 125th St have a large number of accidents.

Conclusions:

Executive Summary

Introduction

“More New Yorkers are now killed in traffic than murdered by guns. A New Yorker is killed in a crash every 33 hours”, writes Transportation Alternatives. The statistics on traffic accidents in New York City are staggering - there were nearly 142,000 vehicle accidents in the city in 2017 alone. The question, therefore, becomes: Can we derive insights to help make the streets of NYC safer?

We decided to dig deeper into the accidents that take place on the streets of New York City and try and answer some critical questions such as: where do most accidents occur, what are the major reasons behind these accidents and what are the most common times of the day, month and year. We got some interesting insights from our analysis which we have presented in this report.

Our team decided to use the NYPD Motor Vehicle Collisions available as an NYC Open Data source to perform the analysis. The dataset contains 1.22 million rows and 29 columns with information like the date, time and location of the accident along with the number of casualties.

We used R and D3.js to perform all exploratory data analysis and visualization.

Insights from Analyses:

We decided to break down our analysis into three major parts: * Location * Time * Contributing Factors

The following graphs represent these three broad spectrums of analyses well.

1. Choropleth: Location by Zip Code

The choropleth shows the spread of accidents across New York aggregated by Zip code. Regions in Manhattan and Brooklyn experience the highest number of accidents, while Queens and Staten Island have the lowest. It can also be seen that the number of accidents taking place in the peripheral regions of the city is much lower than the number of accidents in the central part of the city, where traffic is more likely to be concentrated. Also, the midtown region of Manhattan experiences more accidents as compared to the other parts of Manhattan.

2. Bar Chart: Time by Hour of Day

The frequency of accidents peaks once during the morning (around 9 am) and once in the evening (around 5 pm). This is probably because these are the rush hours of traffic. Also the number of accidents in the evening rush hours is much higher than the number of accidents in the morning rush hours.

3. Scatter Plot: Spatial Distribution of Contributing Factors Across Manhattan

This chart shows the spread of the reasons leading to accidents across different locations in Manhattan. The Midtown East and the Lower East Side regions have a high concentration of accidents caused by driver inattention/distraction. West and midtown area has a high number of accidents due to fatigued/drowsy drivers or due to drivers losing consciousness.

We also created two interactive D3 visualizations:

Conclusions/Suggestions:

  • The increase in the missing borough information indicates requirement of improvement in data collection sources.
  • People should drive more carefully during the rush hours (8 am - 9 am and 4 pm - 6 pm).
  • Better traffic control strategies around Midtown Manhattan and parts of Brooklyn might improve the traffic situation in New York City.
  • We verified that Government initiatives like Vision Zero are effective in reducing accidents. The Government should continue taking such initiatives.